International Journal of Epidemiology — Latest Matching Preprints

1

Design and quality control of large-scale two-sample Mendelian randomisation studies

Fatty Acids in Cancer Mendelian Randomization Collaboration, ; Haycock, P. C.; Borges, M. C.; Burrows, K.; Lemaitre, R. N.; Harrison, S.; Burgess, S.; Chang, X.; Westra, J.; Khankari, N. K.; Tsilidis, K. K.; Gaunt, T.; Hemani, G.; Zheng, J.; Truong, T.; OMara, T.; Spurdle, A. B.; Law, M. H.; Slager, S.; Birmann, B.; Hosnijeh, F. S.; Mariosa, D.; Amos, C. I.; Hung, R. J.; Zheng, W.; Gunter, M. J.; Davey Smith, G.; Relton, C.; Martin, R. M.

2021-08-01 epidemiology 10.1101/2021.07.30.21260578 medRxiv

Top 0.1%

60.4%

Show abstract

BackgroundMendelian randomization studies are susceptible to meta-data errors (e.g. incorrect specification of the effect allele column) and other analytical issues that can introduce substantial bias into analyses. We developed a quality control pipeline for the Fatty Acids in Cancer Mendelian Randomization Collaboration (FAMRC) that can be used to identify and correct for such errors. MethodsWe invited cancer GWAS to share summary association statistics with the FAMRC and subjected the collated data to a comprehensive QC pipeline. We identified meta data errors through comparison of study-specific statistics to external reference datasets (the NHGRI-EBI GWAS catalog and 1000 genome super populations) and other analytical issues through comparison of reported to expected genetic effect sizes. Comparisons were based on three sets of genetic variants: 1) GWAS hits for fatty acids, 2) GWAS hits for cancer and 3) a 1000 genomes reference set. ResultsWe collated summary data from six fatty acid and 49 cancer GWAS. Meta data errors and analytical issues with the potential to introduce substantial bias were identified in seven studies (13%). After resolving analytical issues and excluding unreliable data, we created a dataset of 219,842 genetic associations with 87 cancer types. ConclusionIn this large MR collaboration, 13% of included studies were affected by a substantial meta data error or analytical issue. By increasing the integrity of collated summary data prior to their analysis, our protocol can be used to increase the reliability of post-GWAS analyses. Our pipeline is available to other researchers via the CheckSumStats package (https://github.com/MRCIEU/CheckSumStats).

2

Unintended Pregnancy and Preterm Birth in the United States: Causal Inference and Risk Prediction Using National Survey of Family Growth Data

ADETUNJI, S. A.; Nerandra, P. M.

2025-11-27 sexual and reproductive health 10.1101/2025.11.25.25341033 medRxiv

Top 0.1%

42.6%

Show abstract

BackgroundUnintended pregnancy remains common in high income countries and has been linked to poorer maternal and neonatal outcomes. Whether pregnancy intention has an independent, causal effect on preterm birth, beyond social and clinical risk factors, is uncertain. MethodsWe conducted a cross-sectional analysis of a nationally representative sample of singleton live births from a US reproductive health survey. Pregnancy intention (intended vs unintended) was reported at conception. Preterm birth was defined as delivery before 37 completed weeks. We used survey-weighted logistic regression and a suite of causal estimators, including inverse probability weighted marginal structural models, augmented inverse probability weighting, targeted maximum likelihood estimation with Super Learner, Bayesian g-computation, and causal forests. Models adjusted for maternal age, race and ethnicity, parity, marital or cohabiting status, education, poverty ratio, insurance, and body mass index. We also trained Super Learner prediction models with 10-fold cross validation and evaluated discrimination, calibration, high risk stratification, and net clinical benefit. FindingsIn the weighted population, 39.1% of pregnancies were unintended. Preterm birth occurred in 12.6% of unintended vs 9.0% of intended pregnancies. In survey-weighted logistic models, unintended pregnancy was associated with higher odds of preterm birth (adjusted odds ratio 1.43, 95% CI 1.06 to 1.94). Across advanced causal estimators, the risk difference for unintended vs intended pregnancy was small but consistent, around 3 excess preterm births per 100 live births, with limited positivity and modest E-values suggesting that unmeasured confounding could attenuate or explain part of the association. A Super Learner ensemble achieved excellent discrimination (area under the curve about 0.98 vs 0.56 for baseline logistic regression), good calibration, and identified a top 10% risk stratum with markedly higher observed preterm birth risk than the lower 90%. InterpretationIn this national sample, unintended pregnancy functioned primarily as a marker of concentrated social and clinical vulnerability rather than a large, isolated causal driver of preterm birth. Nonetheless, pregnancy intention materially improved risk stratification when combined with standard covariates. Joint use of causal inference and machine learning provides a defensible framework to target intensified antenatal support to women at highest risk while avoiding overinterpretation of intention as a deterministic cause. FundingNo external funding.

3

Multi-omics Analysis of Umbilical Cord Hematopoietic Stem Cells from a Multi-ethnic Cohort of Hawaii Reveals the Transgenerational Effect of Maternal Pre-Pregnancy Obesity

Du, Y.; Benny, P. A.; Shao, Y.; Schlueter, R. J.; Gurary, A.; Lum-Jones, A.; Lassiter, C. B.; AlAkwaa, F. M.; Tiirikainen, M.; Towner, D.; Ward, W. S.; Garmire, L. X.

2024-08-13 sexual and reproductive health 10.1101/2024.07.27.24310936 medRxiv

Top 0.1%

41.9%

Show abstract

BackgroundMaternal obesity is a health concern that may predispose newborns to a high risk of medical problems later in life. To understand the intergenerational effect of maternal obesity, we hypothesized that the maternal obesity effect is mediated by epigenetic changes in the CD34+/CD38-/Lin- hematopoietic stem cells (uHSCs) in the offspring. Towards this, we conducted a DNA methylation centric multi-omics study. We measured the DNA methylation and gene expression in the CD34+/CD38-/Lin- uHSCs and metabolomics of the cord blood, all from a multi-ethnic cohort (n=72) from Kapiolani Medical Center for Women and Children in Honolulu, Hawaii (collected between 2016 and 2018). ResultsDifferential methylation (DM) analysis unveiled a global hypermethylation pattern in the maternal pre-pregnancy obese group (BH adjusted p<0.05), after adjusting for major clinical confounders. KEGG pathway enrichment, WGCNA, and PPI analyses revealed hypermethylated CpG sites were involved in critical biological processes, including cell cycle, protein synthesis, immune signaling, and lipid metabolism. Utilizing Shannon entropy on uHSCs methylation, we discerned notably higher quiescence of uHSCs impacted by maternal obesity. Additionally, the integration of multi-omics data-including methylation, gene expression, and metabolomics-provided further evidence of dysfunctions in adipogenesis, erythropoietin production, cell differentiation, and DNA repair, aligning with the findings at the epigenetic level. Furthermore, we trained a random forest classifier using the CpG sites in the genes of the top pathways associated with maternal obesity, and applied it to predict cancer vs. adjacent normal labels from samples in 14 Cancer Genome Atlas (TCGA) cancer types. Five of 14 cancers showed balanced accuracy of 0.6 or higher: LUSC (0.87), PAAD (0.83), KIRC (0.71), KIRP (0.63) and BRCA (0.60). ConclusionsThis study revealed the significant correlation between pre-pregnancy maternal obesity and multi-omics level molecular changes in the uHSCs of offspring, particularly in DNA methylation. Moreover, these maternal obesity epigenetic markers in uHSCs may predispose offspring to higher risks in certain cancers.

4

Polygenic Prediction of Type 2 Diabetes in Continental Africa

Chikowore, T.; Ekoru, K.; Vujkovic, M.; Gill, D.; Pirie, F.; Young, E.; Sandhu, M.; McCarthy, M.; Rotimi, C.; Adeyemo, A.; Motola, A.; Fatumo, S.

2021-02-12 genetics 10.1101/2021.02.11.430719 medRxiv

Top 0.1%

41.2%

Show abstract

ObjectivePolygenic prediction of type 2 diabetes in continental Africans is adversely affected by the limited number of genome-wide association studies (GWAS) of type 2 diabetes from Africa, and the poor transferability of European derived polygenic risk scores (PRS) in diverse ethnicities. We set out to evaluate if African American or multi-ethnic derived PRSs would improve polygenic prediction in continental Africans. Research Design and MethodsUsing the PRSice software, ethnic-specific PRSs were computed with weights from the type 2 diabetes GWAS of the Million Veteran Program (MVP) study. The South African Zulu study (1602 cases and 976 controls) was used as the target data set. Replication and assessment of the best predictive PRS association with age at diagnosis was done in the Africa America Diabetes Mellitus (AADM) study (1031 cases and 738 controls). ResultsThe African American derived PRS was more predictive of type 2 diabetes compared to the European and multi-ethnic derived scores. Notably, participants in the 10th decile of this PRS had a 3.19-fold greater risk (OR 3.19; 95%CI (1.94-5.29), p = 5.33 x10-6) of developing diabetes and were diagnosed 2.6 years earlier compared to those in the first decile. ConclusionsAfrican American derived PRS enhances polygenic prediction of type 2 diabetes in continental Africans. Improved representation of non-Europeans populations (including Africans) in GWAS, promises to provide better tools for precision medicine interventions in type 2 diabetes.

5

Cohort Profile: Swiss Personalized Health Network Cohort Consortium

Bochud, M.; Tiali, S. E.; Armida, J.; Wissa, R.; Österle, S.; Blanco, J. M.; Ghobril, J. P.; Henchoz, Y.; Pittet, V.; Ponte, B.; Benkert, P.; Kuhle, J.; Castelao, E.; Preisig, M.; Chizzolini, C.; Günthard, H. F.; Kusejko, K.; Imboden, M.; Probst-Hensch, N.; Koller, M.; Marques-Vidal, P.; Vollenweider, P.; Pruijm, M.; Rauch, A.; Ribi, C.; Scherer, A.; Tellenbach, C.; Vaucher, J.; Fortier, I.

2025-10-13 epidemiology 10.1101/2025.10.10.25337504 medRxiv

Top 0.1%

40.4%

Show abstract

BackgroundSwiss cohort studies provide high-quality longitudinal data, but finding and comparing relevant studies across cohorts has historically been challenging. The Swiss Personalized Health Network Cohort Consortium (SPHN-CC) was established to address these limitations by creating the first coordinated network of Swiss cohort studies within the internationally recognized Maelstrom Research catalogue. MethodsParticipating cohorts were invited in 2021-2022, including longitudinal and cross-sectional studies with 1010-21 993 participants. Data collected include questionnaires, physical and cognitive assessments, administrative records, and biological samples. Variables were classified into 18 domains and 134 subdomains, and an online metadata catalogue was implemented to document study designs, explore variable content, and assess harmonization potential. ResultsThe catalogue enables researchers to identify study-specific and harmonized variables for co-analysis. Core variables, such as age, sex/gender, anthropometrics, and medication use, are widely available, while other variables vary across cohorts. Harmonization assessments demonstrate that several key variables can be co-analyzed across multiple studies, supporting collaborative research with over 37000 participants. A use case illustrates the potential for harmonizing and co-analyzing data across studies. ConclusionsThe SPHN-CC strengthens Swiss cohort research by enhancing data discoverability, supporting harmonization, and facilitating cross-cohort and international research, providing a model for more efficient use of high-value longitudinal data. Key featuresO_LIThe Swiss Personalized Health Network Cohort Consortium aims to optimize the use of data and biological samples collected by publicly funded Swiss cohort studies. C_LIO_LIUp to now, 10 studies participated in the initiative. From 1988 to 2020 they together recruited over 50 000 participants. Recruitment remains active for six of the studies. Most cohorts are still collecting data and biological samples. C_LIO_LIAll studies collected information from questionnaires, nine also collected biospecimens, seven performed physical measurements, two conducted cognitive assessments and two retrieved information from administrative databases at least once during the life course of the study. C_LIO_LIAn online study and variables catalogue was developed to help researchers determine whether data collected might serve to answer the specific research questions they would like to address and, if relevant, may be harmonized and co-analyzed across studies. C_LIO_LIAccess to the metadata catalogue is open and free. C_LI

6

Data Resource Profile: Linking electronic health and social records to study and lower health inequalities in cardiovascular diseases (BIG-HEART)

Loo, L.; Umov, N.; Oja, M.; Reisberg, S.; Uuskula, A.; Kolde, R.; Tillmann, T.

2025-05-11 epidemiology 10.1101/2025.05.09.25327142 medRxiv

Top 0.1%

40.0%

Show abstract

The BIG-HEART cohort was established to study and reduce health inequalities in cardiovascular disease by linking rich, multidimensional electronic health and social data across Estonia. The dataset includes all individuals aged 36 and above residing in Estonia in 2012 (N= 770,323). Its full population coverage minimises sampling and healthy volunteer bias. Existing funding and permits will support annual health outcome follow-up through at least 2026, with possible future extensions. The dataset integrates all routinely collected individual-level primary and secondary care health data (including in- and outpatient visits, diagnoses, prescriptions issued and filled), mortality data, and extensive social data (e.g. ethnicity, education, marital status, social benefits, unemployment history, land and business ownership) from eight national registries. This enables exploration of novel social epidemiology dimensions--such as unbiased wealth measures, medication adherence, and care quality--and supports development of equity-enhancing clinical risk prediction algorithms and large language models. Health and social data are linked using pseudonymised identifiers derived from national personal identification numbers, ensuring accuracy and privacy. The data are stored in the OMOP common data model, facilitating international collaboration. Collaboration inquiries are welcome and can be directed to the BIG-HEART team at taavi.tillmann@ut.ee.

7

Mendelian randomization infers the effect of 14 parental illnesses on 44 congenital anomalies

Li, Y.; Shao, W.; Tian, T.; Tan, L.

2024-07-14 sexual and reproductive health 10.1101/2024.07.13.24310358 medRxiv

Top 0.1%

38.7%

Show abstract

BackgroundCongenital anomalies (CA), including congenital malformations (CM) and congenital deformities (CD), are significant health concerns influenced by genetic and environmental factors. Parental illnesses, especially those with genetic components, may affect the risk of congenital anomalies in offspring. Although clinical studies have suggested associations between certain parental illnesses and increased CM and CD risk, causal relationships remain unclear. This study employs a Mendelian randomization (MR) approach to investigate these potential causal links. MethodsFourteen parental illnesses were selected for this study: breast cancer, chronic bronchitis/emphysema, diabetes, heart disease, hypertension, and Alzheimers disease in mothers; and Alzheimers disease, bowel cancer, chronic bronchitis/emphysema, diabetes, heart disease, hypertension, lung cancer, and prostate cancer in fathers. Genetic variants associated with these illnesses were identified from genome-wide association studies (GWAS) in the UK Biobank. Genetic data for 44 congenital anomalies were sourced from the FinnGen database. Two-sample MR was conducted to estimate causal effects, with sensitivity analyses and multivariable MR (MVMR) to control for potential confounders. ResultsMR analysis revealed causal relationships between 13 parental illnesses and 13 specific congenital anomalies. Notably, mothers hypertension significantly increased the risk of congenital hypothyroidism (IVW: OR = 7.969, 95% CI = 3.0826-20.6011, p = 4.20E-04), and fathers diabetes increased the risk of congenital heart defects in offspring (IVW: OR = 3.8E+09, 95% CI = 2.2E+04-6.6E+14, p = 3E-04). The associations strength varied with the type of parental illness and the specific congenital disease. ConclusionThis study underscores the utility of MR in elucidating genetic influences of parental health conditions on congenital anomalies. The findings highlight the importance of managing parental health to reduce congenital anomalies risk in offspring. Further research is needed to explore underlying biological mechanisms and validate these findings in diverse populations.

8

Breast cancer over-diagnosis due to mammography screening - A long-term follow-up population study of BreastScreen Norway

Heggland, T.; Vatten, L. J.; Opdahl, S.; Weedon-Fekjaer, H.

2026-06-03 epidemiology 10.64898/2026.06.02.26354696 medRxiv

Top 0.1%

38.0%

Show abstract

Objectives Estimates of breast cancer over-diagnosis related to mammography screening varies substantially. Over-diagnosis is commonly defined as cases that would not have been detected during the persons remaining lifetime in the absence of screening. We here aim to quantify over-diagnosis in the population-based BreastScreen Norway mammography screening program using long-term follow-up and more detailed modeling than previous studies. Setting We applied data on Norwegian screening patterns and breast carcinoma incidence for the period 1987-2019, covering women aged 49-84 years, leveraging the gradual implementation of the organized biennial BreastScreen Norway screening program for women aged 50-69 during 1995-2005. Methods Using an extended age-period-cohort model, we estimated excess lifetime risk of invasive breast cancer and ductal carcinoma in situ in the presence of program screening, as an indicator of over-diagnosis among screen-detected cases. Results Lifetime risk of breast carcinomas was 6.6% (95% confidence interval 2.5% to 10.7%) higher for invited than for non-invited women. This indicates that 18% (95% confidence interval 7.3% to 28.0%) of screen-detected cases may be over-diagnosed, and that approximately one in 86 (95% confidence interval 54 to 210) screened women were over-diagnosed during their screening period. Using effect estimates from previous studies, we estimated that approximately three women are over-diagnosed for every breast cancer death prevented by screening, and that 87% of over-diagnosed tumors might grow extremely slowly. Conclusions Over-diagnosis related to mammography screening is a considerable problem, but its extent may be smaller than reported in some previous studies. Most over-diagnosed tumors likely grow very slowly.

9

Computation of Socio-Economic Status in the AWI-Gen Project

Hazelhurst, S.; Boua, P. R.; Choudhury, A.; Madala, S.; Sengupta, D.; Tluway, F.; Ramsay, M.

2024-08-22 epidemiology 10.1101/2024.08.22.24312411 medRxiv

Top 0.1%

35.0%

Show abstract

Socio-economic status of participants in many public health, epidemiological, and genome-wide association studies is an important trait of interest. It is often used in these studies as a measure of direct interest or as a covariate. The Africa Wits INDEPTH Partnership for Genomic and Environmental Research (AWI-Gen) explores genomic and environmental factors in non-communicable diseases, particularly cardio-metabolic disease. In Phase I of AWI-Gen, approximately 12,000 participants were recruited at six sites in four African countries. Participants were asked questions about asset ownership. This technical note describes how AWI-Gen computed socio-economic status from the asset register.

10

Age-dependent effects of adiposity on asthma risk during childhood and adulthood: a lifecourse Mendelian randomization study

Urquijo, H.; Leyden, G. M.; Davey Smith, G.; Richardson, T. G.

2023-08-13 epidemiology 10.1101/2023.08.08.23293842 medRxiv

Top 0.1%

34.6%

Show abstract

BackgroundSeparating the direct and long-term consequences of childhood lifestyle factors on asthma risk can be exceptionally challenging in epidemiology given that cases are typically diagnosed at various timepoints throughout the lifecourse. MethodsIn this study, we used human genetic data to evaluate the effects of childhood and adulthood adiposity on risk of pediatric (n=13,962 cases) and adult-onset asthma (n=26,582 cases) with a common set of controls (n=300,671) using a technique known as lifecourse Mendelian randomization. FindingsWe found that childhood adiposity increases risk of pediatric asthma (OR=1.20, 95% CI=1.03 to 1.37, P=0.03), whereas there was weak evidence that it has a long-term influence on adult-onset asthma (OR=1.05, 95% CI=0.93 to 1.17, P=0.39). Conversely, there was strong evidence that adulthood adiposity increases asthma risk in midlife using our lifecourse approach (OR=1.37, 95% CI=1.28 to 1.46, P=7x10-12). InterpretationThese findings suggest that adiposity in childhood and adulthood are independent risk factors for asthma at each of their corresponding timepoints in the lifecourse. This inference would not be possible without the application of human genetic data, emphasizing the value of this approach in uncovering risk factors that begin to exert their influence on disease at early stages in life. FundingThe Medical Research Council and the British Heart Foundation.

11

Biologically informed instrument selection for dietary Mendelian randomization using chemosensory receptor variants

Hwang, L.-D.; Lin, C.; Evans, D. M.; Martin, N. G.; Reed, D. R.; Joseph, P. V.

2026-02-06 epidemiology 10.64898/2026.02.05.26345702 medRxiv

Top 0.1%

33.9%

Show abstract

BackgroundMendelian randomization (MR) is increasingly used for causal inference in nutritional epidemiology; however, dietary MR studies often rely on instruments statistically selected from genome-wide association studies of self-reported intake, which are vulnerable to pleiotropy and reverse causation and may violate core MR assumptions. We aimed to develop and evaluate a biologically informed framework for selecting valid genetic instruments for dietary exposures, based on genes encoding taste and olfactory receptors that mediate chemosensory inputs and shape food preferences and dietary behaviour. MethodsWe prioritised 1,214 nonsynonymous variants in 30 taste and 295 olfactory receptor genes with minor allele frequency [≥]1%. Associations with 140 food-liking traits were tested in UK Biobank participants aged 37 to 73 years. Candidate variants were evaluated using a multi-stage filtering pipeline designed to improve instrument validity. This included replication in an independent younger cohort (Avon Longitudinal Study of Parents and Children, age 25), concordance between food liking and intake, exclusion of associations with socioeconomic status, assessment of food specificity accounting for linkage disequilibrium and co-consumption patterns, and directionality testing to reduce reverse causation. Retained variants were applied as instruments in MR analyses to assess cardiometabolic outcomes. ResultsWe identified 268 nonsynonymous variants within 101 olfactory and 16 taste receptor genes associated with 96 food-liking traits. The filtering process yielded 28 candidate instruments for 24 foods. Among these, the instrument for onion liking uniquely satisfied all criteria for classification as high confidence. To demonstrate clinical relevance, genetically proxied onion liking was associated with lower blood pressure and a reduced risk of type 2 diabetes in MR analyses, with no evidence of effects on body mass index, glycaemic traits, or serum lipid levels. ConclusionsGuiding genetic instrument selection using chemosensory receptor genes provides a biologically informed strategy for dietary Mendelian randomization that reduces susceptibility to pleiotropy and reverse causation. This framework enables more robust causal evaluation of diet-disease relationships and strengthens inference in nutritional epidemiology and public health research.

12

Trans-ancestry polygenic models for the prediction of LDL blood levels: An analysis of the UK Biobank and Taiwan Biobank

Hassanin, E.; Lee, K.-H.; Hsieh, T.-C.; Aldisi, R.; Lee, Y.-L.; Bobbili, D.; Krawitz, P.; May, P.; Chen, C.-Y.; Maj, C.

2023-08-06 genetic and genomic medicine 10.1101/2023.08.03.23293320 medRxiv

Top 0.1%

33.5%

Show abstract

BackgroundPolygenic risk scores (PRSs) are proposed for use in clinical and research settings for risk stratification. PRS predictions often show bias toward the population of available genome-wide association studies, which is typically of European ancestry. This study aims to assess the performance differences of ancestry-specific PRS and test the implementation of multi-ancestry PRS to enhance the generalizability of low-density lipoprotein (LDL) cholesterol predictions in the East Asian population MethodsWe computed ancestry-specific and multi-ancestry PRS for LDL using data from the global lipid consortium while accounting for population-specific linkage disequilibrium patterns using PRS-CSx method. We first conducted an ancestry-wide analysis using the UK Biobank dataset (n=423,596) and then applied the same models to the Taiwan Biobank dataset (TWB, n=68,978). PRS performances were based on linear regression with adjustment for age, sex, and principal components. PRS strata were considered to assess the extent to which a PRS categorization can stratify individuals for LDL cholesterol levels in East Asian samples. ResultsPopulation-specific PRS better predicted LDL levels within the target population but multi-ancestry PRS were more generalizable. In the TWB dataset, covariate-adjusted R2 values were 9.3% for ancestry-specific PRS, 6.7% for multi-ancestry PRS, and 4.5% for European-specific PRS. Similar trends (8.6%, 7.8%, 6.2%) were observed in the smaller East Asian population of the UK Biobank (n=1,480). Consistent with the R2 values, PRS stratification in East Asians (TWB) effectively captured a heterogenous variability in LDL blood cholesterol levels across PRS strata. The mean difference in LDL levels between the lowest and highest East Asian-specific PRS (EAS_PRS) deciles was 0.82, compared to 0.59 for European-specific PRS (EUR_PRS) and 0.76 for multi-ancestry PRS. Notably, the mean LDL values in the top decile of multi-ancestry PRS were comparable to those of EAS_PRS (3.543 vs. 3.541, P=0.86). ConclusionsOur analysis of the PRS prediction model for LDL cholesterol further supports the issue of PRS generalizability across populations. Our targeted analysis of the East Asian (EAS) population revealed that integrating non-European genotyping data, accounting for population-specific linkage disequilibrium, and considering meta-analyses of non-European-based GWAS alongside powerful European-based GWAS can enhance the generalizability of LDL PRS.

13

Validity and Interpretation of Two-Sample Mendelian Randomization with Binary Traits

Wu, Z.; Wang, J.

2026-02-18 genetics 10.1101/2024.06.09.598150 medRxiv

Top 0.1%

33.4%

Show abstract

BackgroundTwo-sample Mendelian randomization (MR) is widely applied to binary exposures and outcomes. Yet standard MR models rely on linear effect assumptions that are difficult to interpret for binary traits. Although liability-based interpretations have been suggested, it remains unclear whether conventional summary-data MR is formally justified in this setting or what causal parameter it identifies. MethodsWe develop a liability-threshold framework in which binary traits arise from underlying continuous liabilities. We derive explicit relationships between genome-wide association study (GWAS) coefficients obtained from logistic or linear regression on binary traits and marginal genetic associations on the liability scale. Under small genetic effects, typical for complex traits, observed-scale GWAS coefficients are approximately proportional to liability-scale associations. ResultsThis proportionality implies that standard two-sample MR methods remain statistically coherent for binary traits. MR applied to binary exposures or outcomes estimates a scaled causal effect between underlying liabilities rather than an effect on the observed binary scale. The scaling factor depends primarily on trait prevalence and is directly computable. Simulations and UK Biobank analyses confirm that, after rescaling, MR using binary traits recovers liability-scale causal effects consistent with analyses based on continuous traits. ConclusionsWe provide a formal statistical justification for summary-data MR with binary traits and clarify the causal parameter being estimated. These results support routine MR practice for binary exposures and outcomes while enabling coherent interpretation of effect sizes. Key MessagesO_LIThe interpretation of two-sample MR with binary exposures or outcomes is often unclear because GWAS analyses are performed on the observed binary scale. C_LIO_LIUnder a liability threshold framework with small genetic effects, GWAS coefficients from logistic or linear regression on binary traits are approximately proportional to genetic associations on an underlying continuous liability scale. C_LIO_LIConsequently, conventional summary-data MR applied to binary or ordinal traits remains valid and estimates a scaled causal effect between liabilities, requiring no modification of existing methods. C_LI

14

Mortality from COVID-19 in 12 countries and 6 states of the United States

Brown, P.; Jha, P.; CGHR COVID Mortality Consortium,

2020-04-22 epidemiology 10.1101/2020.04.17.20069161 medRxiv

Top 0.1%

33.0%

Show abstract

ImportanceReliable estimates of COVID-19 mortality are crucial to aid control strategies and to assess the effectiveness of interventions. ObjectiveProject COVID-19 mortality trends to October 1, 2020, in 12 countries or regions that constitute >90% of the global COVID-19 deaths reported as of April 12, 2020. Design, Setting, and ParticipantsThe Global COVID-19 Assessment of Mortality (GCAM) is an open, transparent, and continuously updated (www.cghr.org/covid) statistical model that combines actual COVID-19 mortality counts with Bayesian inference to forecast COVID-19 deaths, the date of peak deaths, and the duration of excess mortality. The analyses covered a total of 700 million population above age 20 in 12 countries or regions: USA; Italy; Spain; France; UK; Iran; Belgium; a province of China (Hubei, which accounted for 90% of reported Chinese deaths); Germany; the Netherlands; Switzerland; and Canada; and six US states: New York, New Jersey, Michigan, Louisiana, California, and Washington. ResultsForecasted deaths across the 12 current high-burden countries sum 167,000 to 593,000 (median 253,000). The trajectory of US deaths (49,000-249,000 deaths; median 86,000)--over half of which are expected in states beyond the initial six states analysed in this study--will have the greatest impact on the eventual total. Mortality ranges are 25,000-109,000 (median 46,000) in the UK; 23,000-31,000 (median 26,000) in Italy; 21,000-37,000 (median 26,000) in France and 21,000-32,000 (median 25,000) in Spain. Estimates are most precise for Hubei, China--where the epidemic curve is complete--and least precise in California, where it is ongoing. New York has the highest cumulative median mortality rate per million (1135), about 12-fold that of Germany. Mortality trajectories are notably flatter in Germany, California, and Washington State, each of which took physical distancing and testing strategies seriously. Using past country-specific mortality as a guide, GCAM predicts surge capacity needs, reaching more than twice existing capacity in a number of places., In every setting, the results might be sensitive to undercounts of COVID-19 deaths, which are already apparent. Conclusion and RelevanceMortality from COVID-19 will be substantial across many settings, even in the best case scenario. GCAM will provide continually updated and increasingly precise estimates as the pandemic progresses. The coronavirus disease (COVID-19) pandemic has already caused over 115,000 deaths, with global deaths doubling every week.1-3 Mortality is less biased than case reporting, which is affected by testing policies. However, the daily reporting of COVID-19 deaths is already known to undercount actual deaths, varying over time and place.4-6 Reliable estimates of total COVID-19 mortality, the date of peak deaths, and of the duration of excess mortality are crucial to aid responses to the current and potential future pandemics. We have developed the Global COVID-19 Assessment of Mortality (GCAM), a statistical model to project COVID-19 mortality trends to October 1 2020 in 12 countries or regions that constitute >90% of the global COVID-19 deaths reported as of April 12th. We report also on six US states that account for 70% of the American totals to date (Supplementary Appendix).1 We quantify the COVID-19 mortality trajectory ranges in each setting. A semi-automated website (www.cghr.org/covid) provides daily updates. GCAM is open, transparent, and uses a reasonably simple method that employs publicly reported mortality data to make plausible projections. The method is designed to improve as more mortality data become available over longer time periods.

15

Polygenic and clinical risk scores and their impact on age at onset of cardiometabolic diseases and common cancers

Mars, N. J.; Koskela, J. T.; Ripatti, P.; Kiiskinen, T. T. J.; Havulinna, A. S.; Lindbohm, J. V.; Ahola-Olli, A.; Kurki, M.; Karjalainen, J.; Palta, P.; FinnGen, ; Neale, B. M.; Daly, M.; Salomaa, V.; Palotie, A.; Widen, E.; Ripatti, S.

2019-08-06 genomics 10.1101/727057 medRxiv

Top 0.1%

32.7%

Show abstract

BackgroundPolygenic risk scores (PRS) have shown promise in predicting susceptibility to common diseases. However, the extent to which PRS and clinical risk factors act jointly and identify high-risk individuals for early onset of disease is unknown.\n\nMethodsWe used large-scale biobank data (the FinnGen study; n=135,300), with up to 46 years of prospective follow-up, and the FINRISK study with standardized clinical risk factor measurements to build genome-wide PRSs with >6M variants for coronary heart disease (CHD), type 2 diabetes (T2D), atrial fibrillation (AF), and breast and prostate cancer. We evaluated their associations with first disease events, age at disease onset, and impact together with routinely used clinical risk scores for predicting future disease.\n\nResultsCompared to the 20-80th percentiles, a PRS in the top 2.5% translated into hazard ratios (HRs) for incident disease ranging from 2.03 to 4.28 (p-values 1.96x10-59 to <1.00x10-100) and the bottom 2.5% into HRs ranging from 0.20 to 0.61. The estimated difference in age at disease onset between top and bottom 2.5% of PRSs was 6 to 13 years. Among early-onset cases, 21.3-32.9% had a PRS in the highest decile and in CHD and AF.\n\nConclusionsThe properties of PRS were similar in all five diseases. PRS identified a considerable proportion early-onset cases, and for all ages the performance of PRS was comparable to established clinical risk scores. These findings warrant further clinical studies on application of polygenic risk information for stratified screening or for guiding lifestyle and preventive medical interventions.

16

A systematic review of the reporting and methodological quality of studies that use Mendelian randomisation in UK Biobank

Gibson, M. J.; Spiga, F.; Campbell, A.; Khouja, J. N.; Richmond, R. C.; Munafo, M. R.

2022-04-26 epidemiology 10.1101/2022.04.25.22274252 medRxiv

Top 0.1%

32.6%

Show abstract

BackgroundMendelian randomisation (MR) is a method of causal inference that uses genetic variation as an instrumental variable (IV) to account for confounding. While the number of MR articles published each year is rapidly rising (partly due to large cohort studies such as the UK Biobank making it easier to conduct MR), it is not currently known whether these studies are appropriately conducted and reported in enough detail for other researchers to accurately replicate and interpret them. MethodsWe conducted a systematic review of reporting and analysis quality of MR studies using only individual level data from the UK biobank to calculate a causal estimate. We reviewed 64 eligible articles on a 25-item checklist (based on the STROBE-MR reporting guidelines and the Guidelines for performing Mendelian Randomisation investigations). Information on article type and journal information was also extracted. ResultsOverall, the proportion of articles which reported complete information ranged from 2% to 100% across the different items. Palindromic variants, variant replication, missing data, associations between the IV and variables of exposure/outcome and bias introduced by two-sample methods used on a single sample were often not completely addressed (<11%). There was no clear evidence that Journal Impact Factor, word limit/recommendation or year of publication predicted percentage of article completeness (for the eligible analyses) across items, but there was evidence that whether the MR analyses were primary, joint-primary or secondary analyses did predict completeness. ConclusionsThe results identify areas in which the reporting and conducting of MR studies needs to be improved and highlights that this is independent of Journal Impact Factor, year of publication or word limits/recommendations.

17

Benchmarking Bayesian colocalization methods in validating Mendelian randomization-based target discoveries from circulating proteins for cardiometabolic diseases

Zhang, W.; Yoshiji, S.; Sladek, R.; Dupuis, J.; Lu, T.

2024-10-17 genetics 10.1101/2024.10.14.617627 medRxiv

Top 0.1%

32.4%

Show abstract

BackgroundMendelian randomization (MR) is an important tool for identifying potential biomarkers and drug targets. Colocalization analysis is crucial for validating MR findings and guarding against potential confounding due to linkage disequilibrium. We aim to systematically benchmark the performance of four Bayesian colocalization methods in validating MR-based target discoveries from circulating proteins for cardiometabolic diseases. ResultsWe conducted MR analyses to assess the associations between circulating levels of 1,535 proteins and five cardiometabolic traits, followed by colocalization analyses using coloc, coloc+SuSiE, PWCoCo and SharePro. All methods demonstrated well-controlled false discoveries in the colocalization analysis of 611 pairs of circulating proteins and cardiometabolic traits with a nominal p-value > 0.9 in MR. SharePro demonstrated the highest frequency in supporting 160 (79.6%) of the 201 Bonferroni-significant protein-trait associations identified by MR, compared to coloc (supporting 40.3% of these associations), coloc+SuSiE (46.8%), and PWCoCo (45.8%), and was robust to varying prior colocalization probabilities. Moreover, protein-trait associations identified by MR and supported by SharePro were more likely to agree with significant gene-level associations based on rare variants detected in exome-wide association studies and implicate known drug targets for cardiometabolic diseases. Eight protein-trait associations were exclusively supported by SharePro and did not demonstrate a high risk of horizontal pleiotropy, suggesting potential cardiometabolic biomarkers or drug targets, such as HSF1 and HAVCR2. ConclusionsSharePro most often supports high-confidence associations identified through MR for cardiometabolic diseases. Combining multiple lines of evidence using different methods may substantially increase the yield of biomarker and drug target discovery programs.

18

Explainable AI to predict a complex multifactorial outcome, childhood obesity: Application to clinical epidemiology

Chen, F.; Melton, P.; Vinsen, K.; Mori, T. A.; Beilin, L.; Huang, R.-C.

2025-06-23 epidemiology 10.1101/2025.06.21.25330041 medRxiv

Top 0.1%

28.8%

Show abstract

BackgroundChildhood obesity, driven by genetic and epidemiological factors, poses significant health risks, yet traditional machine learning models lack interpretability for clinical use. ObjectiveThis study aims to apply Kolmogorov-Arnold Networks (KAN), an explainable machine learning model, to predict body mass index (BMI) at age 8 as an indicator of obesity risk and to develop a publicly accessible prediction tool. MethodsWe utilized the Raine Study Gen2 cohort (n=2,868) to train KAN and traditional models (such as Random Forest, Gradient Boosting, Lasso, and Multi-Layer Perceptron) using perinatal, early-life, and polygenic risk score (PGS) data collected before age 5. Feature importance was analyzed across all the models. A publicly accessible online calculator was developed for practical use. ResultsKAN achieved an R2 of 0.81, outperforming traditional models. Key predictors included Year 5 BMI z-score, mid-arm circumference, occupation of mother, and PGS. The online calculator supports predictions without PGS, maintaining an R2 of 0.81. ConclusionsKANs transparent formulas enhance interpretability, offering a practical approach to predicting childhood obesity. The freely accessible tool enables clinicians to implement personalized prevention strategies, advancing precision medicine. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=186 HEIGHT=200 SRC="FIGDIR/small/25330041v3_ufig1.gif" ALT="Figure 1"> View larger version (59K): org.highwire.dtl.DTLVardef@1551d55org.highwire.dtl.DTLVardef@f8e337org.highwire.dtl.DTLVardef@d5447org.highwire.dtl.DTLVardef@11841f8_HPS_FORMAT_FIGEXP M_FIG KAN model predicts childhood obesity (BMI at age 8), showcasing key features, top performance, and accurate formularised results with epidemiological and genetic factors. Online calculator is available at https://bmi-y8-calc.onrender.com/. C_FIG

19

The relationship of major diseases with childlessness: a sibling matched case-control and population register study in Finland and Sweden

Liu, A.; Akimova, E. T.; Ding, X.; Jukarainen, S.; Vartiainen, P.; Kiiskinen, T.; Kuitunen, S.; Havulinna, A. S.; Gissler, M.; Lombardi, S.; Fall, T.; Mills, M. C.; ganna, a.

2022-04-02 sexual and reproductive health 10.1101/2022.03.25.22272822 medRxiv

Top 0.1%

28.7%

Show abstract

The percentage of women born 1965-1975 remaining childless is [~]20% in many Western European and [~]30% in some East Asian countries. Around a quarter of childless women do that voluntary, suggesting a remaining role for disease. Single diseases have been linked to childlessness, mostly in women, yet we lack a comprehensive picture of the effect of early-life diseases on lifetime childlessness. We examined all individuals born 1956-1968 (men) and 1956-1973 (women) in Finland (n=1,035,928) and Sweden (n=1,509,092) to completion of reproduction in 2018 (age 45 women; 50 men). Leveraging nationwide registers, we associated sociodemographic and reproductive information with 414 diseases across 16 categories, using a population and matched pair case-control design of siblings discordant for childlessness (71,524 full-sisters, 77,622 full-brothers). The strongest associations were mental-behavioural, particularly amongst men (schizophrenia, acute alcohol intoxication), congenital anomalies and endocrine-nutritional-metabolic disorders (diabetes), strongest amongst women. We identified novel associations for inflammatory (e.g., myocarditis) and autoimmune diseases (e.g., juvenile idiopathic arthritis). Associations were dependent on age at onset, earlier in women (21-25 years) than men (26-30 years). Disease association was mediated by singlehood, especially in men and by educational level. Evidence can be used to understand how disease contributes to involuntary childlessness. O_TEXTBOXText box:Defining Childlessness We use the term childlessness to describe individuals that have had no live-born children by the end of their reproductive lifespan (age 45 for women; 50 for men). Childlessness is defined in the literature as being both involuntary, related to biology and fecundity (e.g., infertility, inability to find a partner) and voluntary or childfree1 (e.g., active choice, preference2). It has been estimated that 4-5% of the current 15-20% women who are childless in Europe are voluntary childless3. Childless individuals are subjected to discrimination and marginalization in many societies4, with infertile women globally experiencing multiple types of violence and coercion5. A parallel line of work, which is not the position of this paper or authors, is to problematize and stigmatize childless individuals as egoistic and place blame on this group for producing a so-called demographic disaster of shrinking and ageing populations and collapse of social security systems6. The approach of this paper is to provide a neutral, data-driven, and factual examination of early-life diseases related to childlessness, with the aim to design a better understanding of health to prevent childlessness among those who want to have children. C_TEXTBOX

20

Widespread environment-specific causal effects detected in the UK Biobank

Knusel, L.; Man, A.; Pare, G.; Kutalik, Z.

2024-08-21 epidemiology 10.1101/2024.08.21.24312360 medRxiv

Top 0.1%

28.4%

Show abstract

BackgroundMendelian Randomization (MR) is a widely used tool to infer causal relationships. Yet, little research has been conducted on the elucidation of environment specific causal effects, despite mounting evidence for the relevance of causal effect modifying environmental variables. MethodsTo investigate potential modifications of causal effects, we extended two-stage-least-squares MR to investigate interaction effects (2SLS-I). We first tested 2SLS-I in a wide range of realistic simulation settings including quadratic and environment-dependent causal effects. Next, we applied 2SLS-I to investigate how environmental variables such as age, socioeconomic deprivation, and smoking modulate causal effects between a range of epidemiologically relevant exposure (such as systolic blood pressure, education, and body fat percentage) - outcome (e.g. forced expiratory volume (FEV1), CRP, and LDL cholesterol) pairs (in up to 337392 individuals of the UK biobank). ResultsIn simulations, 2SLS-I yielded unbiased interaction estimates, even in presence of non-linear causal effects. Applied to real data, 2SLS-I allowed for the detection of 182 interactions (P<0.001), with age, socioeconomic deprivation, and smoking being identified as important modifiers of many clinically relevant causal effects. For example, the positive causal effect of Triglycerides on systolic blood pressure was significantly attenuated in the elderly whilst the positive causal effect of Gamma-glutamyl transferase on CRP was intensified in smokers. ConclusionWe present 2SLS-I, a method to simultaneously investigate environment-specific and non-linear causal effects. Our results highlight the importance of environmental variables in modifying well-established causal effects.